On Retrieving Legal Files: Shortening Documents and Weeding Out Garbage

نویسندگان

  • Scott Kulp
  • April Kontostathis
چکیده

This paper describes our participation in the TREC Legal experiments in 2007. We have applied novel normalization techniques that are designed to slightly favor longer documents instead of assuming that all documents should have equal weight. We have also developed a new method for reformulating query text when background information is provided with an information request. We have also experimented with using enhanced OCR error detection to reduce the size of the term list and remove noise in the data. In this article, we discuss the impact of these effects on the TREC 2007 data sets. We show that the use of simple normalization methods significantly outperforms cosine normalization in the legal domain.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Search and Retrieval Performance through Shortening Documents, Detecting Garbage, and Throwing Out Jargon

This thesis describes the development of a new search and retrieval system used to index and process queries for several different data sets of documents. This thesis also describes my work with the TREC Legal data set, in particular, the new algorithms I designed to improve recall and precision rates in the legal domain. I have applied novel normalization techniques that are designed to slight...

متن کامل

Investigating Legal Loopholes in the Field of Official Documents in Iran and its Ethical Implications

Background: In the Law on registration of deeds and real estate, the definition of official document and the scope of inclusion of official documents are different from civil law, and these definitions create different interpretations and effects in society and how to deal with legal issues and problems. Resolving legal deficiencies in answering accidental questions that occur in the community,...

متن کامل

Legal Documents Clustering using Latent Dirichlet Allocation

At present due to the availability of large amount of legal judgments in the digital form creates opportunities and challenges for both the legal community and for information technology researchers. This development needs assistance in organizing, analyzing, retrieving and presenting this content in a helpful and distributed manner. We propose an approach to cluster legal judgments based on th...

متن کامل

Belgisch Staatsblad Corpus: Retrieving French-Dutch Sentences from Official Documents

We describe the compilation of a large corpus of French-Dutch sentence pairs from official Belgian documents which are available in the online version of the publication Belgisch Staatsblad/Moniteur belge, and which have been published between 1997 and 2006. After downloading files in batch, we filtered out documents which have no translation in the other language, documents which contain sever...

متن کامل

Categorisation by Context

Assistance in retrieving of documents on the World Wide Web is provided either by search engines, through keyword based queries, or by catalogues, which organise documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult due to the sheer amount of material, and therefore it will be necessary to resort to techniques for automatic classification of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007